There are multiple reasons for analyzing a version control system like your Git repository. See for example Adam Tornhill's book "Your Code as a Crime Scene" or his upcoming book "Software Design X-Rays" for plenty of inspirations:
You can
Having the necessary data for those analyses in a Pandas DataFrame gives you many possibilities to quickly gain insights into the evolution of your software system in various ways.
In a preceding blog post, I showed you a way to read a Git log file with Pandas' DataFrame and GitPython. Looking back, this was really complicated and tedious. So, with a few tricks we can do it much better this time:
The first step is to connect GitPython with the Git repo. If we have an instance of the repo, we can gain access to the underlying Git installation of the operating system via repo.git.
In our case, we tap the Spring PetClinic repo, a small sample application for the Spring framework (I also analyzed the huge Linux repo, works as well).
In [1]:
import git
GIT_REPO_PATH = r'../../spring-petclinic/'
repo = git.Repo(GIT_REPO_PATH)
git_bin = repo.git
git_bin
Out[1]:
With the git_bin, we can execute almost any Git command we like directly. In our hypothetical use case, we want to retrieve some information about the change frequency of files. For this, we need the complete history of the Git repo including statistics for the changed files (via --numstat).
We use a little trick to make sure, that the format for the file's statistics fits nicely with the commit's metadata (SHA %h, UNIX timestamp %at and author's name %aN). The --numstat option provides data for additions and deletions for the affected file name in one line – separated by the tabulator character \t:
1\t1\tsome/file/name.ext
We use the same tabular separator \t for the format string:
%h\t%at\t%aN
And here is the trick: Additionally, we add the number of tabulators of the file's statistics plus an additional tabulator in front of the format string to pretend that there is an empty file statistics' information in front of each commit meta data string.
The results looks like this:
\t\t\t%h\t%at\t%aN
Note: If you want to export the Git log on the command line into a file, you need to use the horizontal tab %x0A as separator instead of \t in the format string. Otherwise, the trick doesn't work (I'll show the corresponding format string at the end of this article).
OK, let's executed the Git log export:
In [2]:
git_log = git_bin.execute('git log --numstat --pretty=format:"\t\t\t%h\t%at\t%aN"')
git_log[:80]
Out[2]:
We now read in the complete files' history in the git_log variable. Don't let confuse you by all the \t characters.
Let's read the result into a Pandas DataFrame by using the read_csv method. Because we can't provide a file path to a CSV data, we have to use StringIO to read in our in-memory buffered content.
Pandas will read the first line of the tabular-separated "file", sees the many tabular-separated columns and parses all other lines in the same format / column layout. Additionally, we set the header to None because we don't have one and provide nice names for all the columns that we read in.
In [3]:
import pandas as pd
from io import StringIO
commits_raw = pd.read_csv(StringIO(git_log),
sep="\t",
header=None,
names=['additions', 'deletions', 'filename', 'sha', 'timestamp', 'author']
)
commits_raw.head()
Out[3]:
Now we have two different kinds of content for the rows:
But we are interested in the commit meta data for each file's statistic. For this, we forward fill (ffill) the empty commit meta data entries of the file statistics rows with the preceding commit's meta data via the DataFrame's fillna method and join this data with the existing columns of the file statistics.
In [4]:
commits = commits_raw[['additions', 'deletions', 'filename']]\
.join(commits_raw[['sha', 'timestamp', 'author']].fillna(method='ffill'))
commits.head()
Out[4]:
This gives use the commit meta data for each file change!
Because we aren't interested in the pure commit meta data anymore, we drop all those rows that don't contain file statistics aka contain null values via dropna.
In [5]:
commits = commits.dropna()
commits.head()
Out[5]:
And that's it! We are finished!
In summary, you just need a "one-liner" for converting the Git log file output that was exported with
git log --numstat --pretty=format:"%x09%x09%x09%h%x09%at%x09%aN" > git.log
and read into a DataFrame:
In [6]:
# reading
git_log = pd.read_csv(
"../../spring-petclinic/git.log",
sep="\t",
header=None,
names=[
'additions',
'deletions',
'filename',
'sha',
'timestamp',
'author'])
# converting in "one line"
git_log[['additions', 'deletions', 'filename']]\
.join(git_log[['sha', 'timestamp', 'author']]\
.fillna(method='ffill'))\
.dropna().head()
Out[6]: